Hi Everybody đđť
Description:
Objective:
Countries can be categorized based on socioeconomic and health indicators that measure a nation's overall level of development. Key factors include economic prosperity, education, healthcare access, and quality of life. By analyzing metrics in areas like income, literacy, life expectancy, and standards of living, countries can be grouped and compared in terms of their developmental progress. This allows for identifying development gaps and priorities for improvement across different country income levels and regions. Examining socioeconomic and health criteria provides insights into a nation's human development capacity and areas requiring policy attention.Problem:
HELP International is a dedicated international humanitarian NGO focused on combating poverty and ensuring access to essential resources and relief in underprivileged nations, particularly in times of disasters and natural calamities. Our organization actively implements various operational projects and engages in advocacy efforts to increase awareness and secure funding for our cause.Goal:
HELP International has raised $10 million that its CEO now needs to strategically allocate to countries most in need of aid. As a data scientist, my job is to analyze key socioeconomic and health factors to categorize countries by level of development. This will allow me to identify countries facing the biggest development challenges and requiring the most attention and aid money. By examining metrics like income distribution, poverty rates, education levels, and healthcare access and outcomes, I can determine where people lack basic necessities and opportunities. My analysis aims to highlight countries with the lowest scores on human development indexes so the CEO can focus aid efforts on those nations facing the most pressing and profound development needs. Effective use of data on socioeconomic and health disparities will allow me to advise on directing the aid money in a way that maximizes positive impact on vulnerable populations worldwide.đIf you don't have libraries, then you should install them. (ex. pip install pandas)
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from kneed import KneeLocator
import warnings
In [2]:
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.style.use('ggplot')
đHere, we import data from a csv file, using pandas library.
In [3]:
data = pd.read_csv('Country-data.csv')
df = pd.DataFrame(data)
df
Out[3]:
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| 1 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| 2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| 3 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| 4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 162 | Vanuatu | 29.2 | 46.6 | 5.25 | 52.7 | 2950 | 2.62 | 63.0 | 3.50 | 2970 |
| 163 | Venezuela | 17.1 | 28.5 | 4.91 | 17.6 | 16500 | 45.90 | 75.4 | 2.47 | 13500 |
| 164 | Vietnam | 23.3 | 72.0 | 6.84 | 80.2 | 4490 | 12.10 | 73.1 | 1.95 | 1310 |
| 165 | Yemen | 56.3 | 30.0 | 5.18 | 34.4 | 4480 | 23.60 | 67.5 | 4.67 | 1310 |
| 166 | Zambia | 83.1 | 37.0 | 5.89 | 30.9 | 3280 | 14.00 | 52.0 | 5.40 | 1460 |
167 rows Ă 10 columns
đ.shape returns the row and column count of the DataFrame in a simple tuple format.
đ.head returns a subset of rows from the top of the DataFrame, allowing convenient previewing of the start of the data.
đ.infoprovides a succinct summary of a DataFrame. It gives concise information about column names, data types, non-missing values and memory usage.
đ.describe(include='all') provides a high-level summary of numeric and object columns in a DataFrame.
In [4]:
df.shape
Out[4]:
(167, 10)
In [5]:
df.head()
Out[5]:
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| 1 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| 2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| 3 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| 4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 167 entries, 0 to 166 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 167 non-null object 1 child_mort 167 non-null float64 2 exports 167 non-null float64 3 health 167 non-null float64 4 imports 167 non-null float64 5 income 167 non-null int64 6 inflation 167 non-null float64 7 life_expec 167 non-null float64 8 total_fer 167 non-null float64 9 gdpp 167 non-null int64 dtypes: float64(7), int64(2), object(1) memory usage: 13.2+ KB
đThere is 167 samples, with no-null values. â
In [7]:
df.describe(include='all')
Out[7]:
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 167 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 |
| unique | 167 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | Afghanistan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | 38.270060 | 41.108976 | 6.815689 | 46.890215 | 17144.688623 | 7.781832 | 70.555689 | 2.947964 | 12964.155689 |
| std | NaN | 40.328931 | 27.412010 | 2.746837 | 24.209589 | 19278.067698 | 10.570704 | 8.893172 | 1.513848 | 18328.704809 |
| min | NaN | 2.600000 | 0.109000 | 1.810000 | 0.065900 | 609.000000 | -4.210000 | 32.100000 | 1.150000 | 231.000000 |
| 25% | NaN | 8.250000 | 23.800000 | 4.920000 | 30.200000 | 3355.000000 | 1.810000 | 65.300000 | 1.795000 | 1330.000000 |
| 50% | NaN | 19.300000 | 35.000000 | 6.320000 | 43.300000 | 9960.000000 | 5.390000 | 73.100000 | 2.410000 | 4660.000000 |
| 75% | NaN | 62.100000 | 51.350000 | 8.600000 | 58.750000 | 22800.000000 | 10.750000 | 76.800000 | 3.880000 | 14050.000000 |
| max | NaN | 208.000000 | 200.000000 | 17.900000 | 174.000000 | 125000.000000 | 104.000000 | 82.800000 | 7.490000 | 105000.000000 |
In [8]:
df.isnull().sum()
Out[8]:
country 0 child_mort 0 exports 0 health 0 imports 0 income 0 inflation 0 life_expec 0 total_fer 0 gdpp 0 dtype: int64
đThere is no Missing values in Dataframe. â
In [9]:
df_duplicated = df.copy()
df_duplicated.drop_duplicates(subset=None, inplace=True)
df_duplicated.shape
Out[9]:
(167, 10)
In [10]:
df.shape
Out[10]:
(167, 10)
đThere is no duplicated values in Dataframe. â
In [11]:
df[df['country'] == 'Belarus']
Out[11]:
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 14 | Belarus | 5.5 | 51.4 | 5.61 | 64.5 | 16200 | 15.1 | 70.4 | 1.49 | 6030 |
In [12]:
df[df['country'] == 'Austria']
Out[12]:
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 8 | Austria | 4.3 | 51.3 | 11.0 | 47.8 | 43200 | 0.873 | 80.5 | 1.44 | 46900 |
In [13]:
df['exports'] = df['exports'] * df['gdpp']/100
df['imports'] = df['imports'] * df['gdpp']/100
df['health'] = df['health'] * df['gdpp']/100
In [14]:
df.head()
Out[14]:
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 90.2 | 55.30 | 41.9174 | 248.297 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| 1 | Albania | 16.6 | 1145.20 | 267.8950 | 1987.740 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| 2 | Algeria | 27.3 | 1712.64 | 185.9820 | 1400.440 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| 3 | Angola | 119.0 | 2199.19 | 100.6050 | 1514.370 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| 4 | Antigua and Barbuda | 10.3 | 5551.00 | 735.6600 | 7185.800 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
In [15]:
df.describe()
Out[15]:
| child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|
| count | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 |
| mean | 38.270060 | 7420.618847 | 1056.733204 | 6588.352108 | 17144.688623 | 7.781832 | 70.555689 | 2.947964 | 12964.155689 |
| std | 40.328931 | 17973.885795 | 1801.408906 | 14710.810418 | 19278.067698 | 10.570704 | 8.893172 | 1.513848 | 18328.704809 |
| min | 2.600000 | 1.076920 | 12.821200 | 0.651092 | 609.000000 | -4.210000 | 32.100000 | 1.150000 | 231.000000 |
| 25% | 8.250000 | 447.140000 | 78.535500 | 640.215000 | 3355.000000 | 1.810000 | 65.300000 | 1.795000 | 1330.000000 |
| 50% | 19.300000 | 1777.440000 | 321.886000 | 2045.580000 | 9960.000000 | 5.390000 | 73.100000 | 2.410000 | 4660.000000 |
| 75% | 62.100000 | 7278.000000 | 976.940000 | 7719.600000 | 22800.000000 | 10.750000 | 76.800000 | 3.880000 | 14050.000000 |
| max | 208.000000 | 183750.000000 | 8663.600000 | 149100.000000 | 125000.000000 | 104.000000 | 82.800000 | 7.490000 | 105000.000000 |
In [16]:
df.hist(bins=100,figsize=(30,20))
plt.show()
In [17]:
plt.figure(figsize=(9, 7))
sns.heatmap(data=df.iloc[:, 1:].corr(), annot=True, fmt=".2f", linewidth=0.75, cmap='Reds')
plt.show()
đThe country column is not being considered when creating the density plot, since that column is not used to generate the density visualization.â
In [18]:
density_features=df.columns[1:]
density_features
Out[18]:
Index(['child_mort', 'exports', 'health', 'imports', 'income', 'inflation',
'life_expec', 'total_fer', 'gdpp'],
dtype='object')
In [19]:
for i in enumerate(density_features):
print(i)
(0, 'child_mort') (1, 'exports') (2, 'health') (3, 'imports') (4, 'income') (5, 'inflation') (6, 'life_expec') (7, 'total_fer') (8, 'gdpp')
đ Visusalization of density_features with histplot(data, stat='density'), kde=True to find data distribiution.â
In [20]:
plt.figure(figsize=(22, 20))
for i, feature in enumerate(density_features):
plt.subplot(5, 2, i+1)
sns.histplot(df[feature], stat="density", kde=True)
plt.title(f"Distribution of {feature}")
plt.tight_layout()
plt.show()
In [21]:
child_mort = df[['country','child_mort']].sort_values('child_mort', ascending=False)
fig = px.bar(child_mort, x='country', y='child_mort', color='country', labels={'country': 'Country Name', 'child_mort': 'Child Mortality Rate'})
fig.update_layout(title = 'Death of children under 5 years of age per 1000 live births',
title_x=0.4,
xaxis={'tickangle': 270},
width=1000, height=600)
fig.show()
In [22]:
export = df[['country','exports']].sort_values('exports', ascending=False)
fig = px.bar(export, x='country', y='exports', color='country', labels={'country': 'Country Name', 'exports': 'Export'})
fig.update_layout(title = 'Exports of goods and services',
title_x=0.4,
xaxis={'tickangle': 270},
width=1000, height=600)
fig.show()
In [23]:
health = df[['country','health']].sort_values('health', ascending=False)
fig = px.bar(health, x='country', y='health', color='country', labels={'country': 'Country Name', 'health': 'Health'})
fig.update_layout(title = 'Total health spending',
title_x=0.4,
xaxis={'tickangle': 270},
width=1000, height=600)
fig.show()
In [24]:
imports = df[['country','health']].sort_values('health', ascending=False)
fig = px.bar(health, x='country', y='health', color='country', labels={'country': 'Country Name', 'health': 'Health'})
fig.update_layout(title = 'Total health spending',
title_x=0.4,
xaxis={'tickangle': 270},
width=1000, height=600)
fig.show()
In [25]:
plt.figure(figsize=(24,20))
for i in enumerate(density_features):
plt.subplot(5,2,i[0]+1)
sns.boxplot(x= i[1], data = df)
đBased on the .boxplot information, we can make the following inferences and outline the required outlier treatment:
1ď¸âŁ Child Mortality: There are outliers on the higher end, indicating countries with high child mortality rates. These countries are the potential targets for aid, so no outlier treatment is necessary. 2ď¸âŁ Exports: There are many outliers on the higher end. To address this, we can cap the outliers at the 95th percentile value. 3ď¸âŁ Health: Similarly, there are outliers on the higher end for health expenditure. Capping these outliers at the 95th percentile value would be appropriate. 4ď¸âŁ Imports: Many countries heavily depend on imports, resulting in outliers on the higher end. Capping these outliers at the 95th percentile value would be suitable. 5ď¸âŁ Income: There are numerous outliers on the higher end for income. Since the focus is on identifying low-income countries for aid, capping the outliers at the 95th percentile value is recommended. 6ď¸âŁ Inflation: Outliers in the higher range of inflation exist. To address this, capping the outliers at the 95th percentile value would be appropriate. 7ď¸âŁ Life Expectancy: There are some outliers on the lower end for life expectancy. Capping these outliers at the 5th percentile value is recommended. 8ď¸âŁ Total Fertility: There is a single outlier on the higher end for total fertility. Capping this outlier at the 95th percentile value would be suitable. 9ď¸âŁ GDP per capita (GDPP): Many outliers exist on the higher end for GDPP. Since the goal is to identify countries with low GDPP for aid, capping the outliers at the 95th percentile value is advised. â ď¸ Considering the limited data available, it is not advisable to remove or drop rows as it may lead to the exclusion of important countries in need of aid. Therefore, capping the outliers as described above would be a more appropriate approach.
In [26]:
# calculate the values of 99th percentile for exports, health, imports, income, inflation, total_fer, gdpp
q4_exports= df['exports'].quantile(.95)
q4_imports= df['imports'].quantile(.95)
q4_health= df['health'].quantile(.95)
q4_income= df['income'].quantile(.95)
q4_inflation= df['inflation'].quantile(.95)
q4_total_fer= df['total_fer'].quantile(.95)
q4_gdpp= df['gdpp'].quantile(.95)
# calculate the values of 1st percentile for life_expec
q1_life_expec= df['life_expec'].quantile(.05)
In [27]:
#perform Outlier capping
df['exports'][df['exports']>= q4_exports] = q4_exports
df['imports'][df['imports']>= q4_imports] = q4_imports
df['health'][df['health']>= q4_health] = q4_health
df['income'][df['income']>= q4_income] = q4_income
df['inflation'][df['inflation']>= q4_inflation] = q4_inflation
df['total_fer'][df['total_fer']>= q4_total_fer] = q4_total_fer
df['gdpp'][df['gdpp']>= q4_gdpp] = q4_gdpp
df['life_expec'][df['life_expec']<= q1_life_expec] = q1_life_expec
đ Rechecking the .boxplot information to determine if outliers have been removed or not.
In [28]:
plt.figure(figsize=(24,20))
for i in enumerate(density_features):
plt.subplot(5,2,i[0]+1)
sns.boxplot(x= i[1], data = df)
In [29]:
df.describe()
Out[29]:
| child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|
| count | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 |
| mean | 38.270060 | 5783.114656 | 954.012480 | 5140.089474 | 15738.760479 | 6.929317 | 70.863593 | 2.917479 | 11998.826347 |
| std | 40.328931 | 8580.053847 | 1453.426636 | 6696.210005 | 14787.154215 | 6.384357 | 8.085376 | 1.443771 | 15158.213199 |
| min | 2.600000 | 1.076920 | 12.821200 | 0.651092 | 609.000000 | -4.210000 | 55.780000 | 1.150000 | 231.000000 |
| 25% | 8.250000 | 447.140000 | 78.535500 | 640.215000 | 3355.000000 | 1.810000 | 65.300000 | 1.795000 | 1330.000000 |
| 50% | 19.300000 | 1777.440000 | 321.886000 | 2045.580000 | 9960.000000 | 5.390000 | 73.100000 | 2.410000 | 4660.000000 |
| 75% | 62.100000 | 7278.000000 | 976.940000 | 7719.600000 | 22800.000000 | 10.750000 | 76.800000 | 3.880000 | 14050.000000 |
| max | 208.000000 | 31385.100000 | 4966.701000 | 24241.560000 | 48290.000000 | 20.870000 | 82.800000 | 5.861000 | 48610.000000 |
đ MinMaxScaler() is a technique used to transform data that falls within a specific range, typically between 0 and 1. This normalization process helps reduce the impact of large values in the dataset.
In [30]:
country = df.country
df.drop(columns='country', inplace=True)
# Normalize
scaler = MinMaxScaler().fit_transform(df)
norm_df = pd.DataFrame(scaler, columns=df.columns)
đ PCA (Principal Component Analysis) is a technique that reduces the number of features in a dataset while preserving important patterns. It achieves this by transforming the original features into a new set of uncorrelated variables called principal components. This simplification improves computational efficiency, reduces noise, and aids in data visualization and interpretation.
In [31]:
pca = PCA(n_components=9).fit(norm_df)
exp = pca.explained_variance_ratio_
print(exp)
[0.70672198 0.13776226 0.07858595 0.03255028 0.01867945 0.01325302 0.00823535 0.00279766 0.00141406]
In [32]:
plt.plot(np.cumsum(exp), linewidth=2, marker = 'o', linestyle = '--')
plt.title("PCA", fontsize=20)
plt.xlabel('n_component')
plt.ylabel('Variance Ratio')
plt.yticks(np.arange(0.55, 1.05, 0.05))
plt.show()
đ n_component=5, has above 95% of result.
In [33]:
final_pca = IncrementalPCA(n_components=5).fit_transform(norm_df)
In [34]:
final_pca.shape
Out[34]:
(167, 5)
In [35]:
pc = np.transpose(final_pca)
In [36]:
corr_pca = np.corrcoef(pc)
In [37]:
sns.heatmap(data=corr_pca, annot=True, fmt=".2f", linewidth=0.75, cmap="Reds")
plt.show()
In [38]:
pca_df = pd.DataFrame({
'PC1':pc[0],
'PC2':pc[1],
'PC3':pc[2],
'PC4':pc[3],
'PC5':pc[4],
})
pca_df
Out[38]:
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| 0 | -0.868026 | 0.468154 | -0.116122 | 0.015084 | 0.030630 |
| 1 | -0.075952 | -0.471567 | -0.026781 | -0.007173 | 0.019178 |
| 2 | -0.214388 | -0.201541 | 0.392519 | 0.091286 | 0.133112 |
| 3 | -0.807603 | 0.610289 | 0.331504 | 0.069905 | 0.046915 |
| 4 | 0.231294 | -0.264191 | -0.118967 | -0.066490 | 0.102013 |
| ... | ... | ... | ... | ... | ... |
| 162 | -0.453124 | -0.050333 | -0.243497 | -0.067826 | -0.040117 |
| 163 | -0.038596 | -0.112832 | 0.580115 | 0.142696 | 0.001215 |
| 164 | -0.289604 | -0.367720 | 0.210735 | 0.004976 | -0.049341 |
| 165 | -0.641633 | 0.186166 | 0.429498 | 0.124173 | 0.116803 |
| 166 | -0.832937 | 0.457908 | 0.070357 | 0.016176 | -0.056084 |
167 rows Ă 5 columns
đFound outliers of each PCA.
In [39]:
fig, ax = plt.subplots(figsize=(15,6))
sns.boxplot(data=pca_df)
plt.show()
đFind best n_clusters for .KMeans .
In [40]:
kmeans_list = []
kmeans_sil_coef = []
kmeans_calinski_score = []
kmeans_calinski_score = []
kmeans_davies_score = []
for n in range(2, 10) :
# main algorithm
kmeans = KMeans(n_clusters=n).fit(df)
kmeans_list.append(kmeans.inertia_)
# Silhouette
sil_score = silhouette_score(df, kmeans.labels_)
kmeans_sil_coef.append(sil_score)
# Calinski Harabasz Score
calinski_score = calinski_harabasz_score(df, kmeans.labels_)
kmeans_calinski_score.append(calinski_score)
# Davies Bouldin Score
davies_score = davies_bouldin_score(df, kmeans.labels_)
kmeans_davies_score.append(davies_score)
k1 = KneeLocator(range(2, 10), kmeans_list, curve='convex', direction='decreasing')
In [41]:
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
# Ax 1 : inertia
ax[0, 0].plot(range(2, 10), kmeans_list, label='Inertia', linewidth=2)
ax[0, 0].axvline(x=k1.elbow, ls='--', label='k1 elbow', color='purple', alpha=0.65)
ax[0, 0].set_title('Inertia', fontsize=20, fontweight=600, color='Green')
ax[0, 0].legend()
# Ax 2 : Silhouette Score
ax[0, 1].plot(range(2, 10), kmeans_sil_coef, label='Silhouette Score')
ax[0, 1].set_title('Silhouette Score', fontsize=20, fontweight=600, color='Green')
ax[0, 1].axvline(x=3, ls='--', alpha=0.65, label='Best n', color='Blue')
ax[0, 1].legend()
# Ax 3 : Calinski Harabasz Score
ax[1, 0].plot(range(2, 10), kmeans_calinski_score, label='Kalinski Harabasz Score')
ax[1, 0].set_title('Kalinski Harabasz Score', fontsize=20, fontweight=600, color='Green')
ax[1, 0].axvline(x=2, ls='--', label='Best n', color='blue', alpha=0.65)
ax[1, 0].axvline(x=3, ls='--', label='Best 2 n', color='purple', alpha=0.65)
ax[1, 0].legend()
# Ax 4 : Davies Bouldin Score
ax[1, 1].plot(range(2, 10), kmeans_davies_score, label='Davies Bouldin Score')
ax[1, 1].set_title('Davies Bouldin Score', fontsize=50, fontweight=700, color='Green')
ax[1, 1].axvline(x=4, ls='--', alpha=0.65, label='Best n', color='Blue')
ax[1, 1].axvline(x=2, ls='--', alpha=0.65, label='second Best', color='purple')
ax[1, 1].set_title('Davies Bouldin Score')
ax[1, 1].legend()
plt.show()
đAccording to the above plots, the best n_clusters for KMeans algorithm is 3.
In [42]:
kmeans = KMeans(n_clusters=3).fit(pca_df)
In [43]:
# add country column
pca_df.insert(0, 'Country', country)
pca_df['class'] = kmeans.labels_
In [44]:
pca_df['Requirement'] = pca_df['class']
In [45]:
rich = int(pca_df[pca_df.Country=='Canada']['class'])
middle = int(pca_df[pca_df.Country=='Iran']['class'])
poor = int(pca_df[pca_df.Country=='Afghanistan']['class'])
In [46]:
rich_label = 'Rich countries'
middle_label = 'Middle countries'
poor_label = 'Poor countries'
In [47]:
pca_df.replace({'Requirement':{rich:rich_label, middle:middle_label, poor:poor_label }},inplace=True)
In [48]:
pca_df.head()
Out[48]:
| Country | PC1 | PC2 | PC3 | PC4 | PC5 | class | Requirement | |
|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | -0.868026 | 0.468154 | -0.116122 | 0.015084 | 0.030630 | 2 | Poor countries |
| 1 | Albania | -0.075952 | -0.471567 | -0.026781 | -0.007173 | 0.019178 | 0 | Middle countries |
| 2 | Algeria | -0.214388 | -0.201541 | 0.392519 | 0.091286 | 0.133112 | 0 | Middle countries |
| 3 | Angola | -0.807603 | 0.610289 | 0.331504 | 0.069905 | 0.046915 | 2 | Poor countries |
| 4 | Antigua and Barbuda | 0.231294 | -0.264191 | -0.118967 | -0.066490 | 0.102013 | 0 | Middle countries |
In [49]:
fig = px.choropleth(pca_df[['Country','class']],
locationmode = 'country names',
locations = 'Country',
color = pca_df['Requirement'],
color_discrete_map = {'Rich countries': 'Green',
'Middle countries':'LightBlue',
'Poor countries':'Red'}
)
fig.update_layout(
title='World Requirement by Country',
title_font_size=24,
title_x=0.26,
margin = dict(
l=10,
r=0,
b=0,
t=50,
pad=5,
),
)
fig.show()
In [50]:
req_counts = pca_df['Requirement'].value_counts()
fig = px.pie(pca_df, names=req_counts.index, values=req_counts.values,
hole=0.2, width=1100, height=500
)
fig.update_layout(title='Country Requirement Distribution', title_x=0.45, title_font=dict(size=24))
fig.show()